Data Visualization Tutorial - Part 2
This is the second part of the introduction to data visualization (in R).
In this practical we will learn:
- How to build plots from simpler to more complex
- How to communicate a story with annotations
- How to combine plots to convey a story
We will be using ggplot2 to plot our data. Let’s make sure you have that installed.
Similarly to the previous practical, we will use the data from the Baker 2016 paper.
# Load in our data:
baker2016 = read.table('baker2016_artificial.txt', header = T)| job | familiar_reproducibility | significance_crisis | proportion_reproducible |
|---|---|---|---|
| PhD Student | Fairly familiar | There is a slight crisis of reproducibility | 70% |
| Student | Very familiar | There is a significant crisis of reproducibility | 50% |
| PhD Student | Fairly familiar | There is a significant crisis of reproducibility | 60% |
| Student | Very familiar | There is a slight crisis of reproducibility | 20% |
| PhD Student | Very familiar | There is a slight crisis of reproducibility | 90% |
| PhD Student | Very familiar | There is a slight crisis of reproducibility | 80% |
What is your story?
In the first practical we noticed that almost the same proportion of participants reported that they were familiar/not familiar with the term ‘Open Science’. The participants in the study came from different fields. One possibility is that the dichotomy in familiarity may be related to how much participants trust the work in their field to be reproducible (e.g. less trust -> more familiarity). We show that this is not the case and that confidence in reproducibiity is generally high amongst all the fields.
1.Build plots from simpler to more complex
Histogram for 1D data (e.g. counts)
Let’s plot again the histogram showing the overall familiarity with Open Science
We use color to highlight. You can add a conditional argument to the color or fill options to highlight certain aspects of your plot i.e. we may want to highlight the participants who responded saying either they weren’t familiar enough with the term Open Science or reasonably familiar to highlight the dichotomoy.
In order to highlight these aspects, we add the conditional argument ifelse(flagged %in% c('A reasonable amount','Not enough'),1,2) to our color option. Note %in% is appropriate here instead of == since we the conditional statements includes more than one condition. Wrapping it in as.factor() simply makes is a discrete variable, which would allow us to choose a discrete colormap.
Also, we would like plot the results from ‘Too much’ to ‘Not enough’ familiarity. This is achieved through the command factor() by specifying the desired order of levels and setting ordered = TRUE.
#some labels are way too long
baker2016 <- baker2016 %>%
mutate(field = str_replace(field,"Astronomy and planetary science","Astronomy")) %>%
mutate(field = str_replace(field,"Earth and Environmental Science","Earth Sciences")) %>%
mutate(flagged = str_replace(flagged,"A reasonable amount","Fair")) %>%
mutate(flagged = str_replace(flagged,"I am unsure","Unsure")) %>%
mutate(flagged = str_replace(flagged,"Not enough","Little")) %>%
mutate(flagged = str_replace(flagged,"Too much","Large"))
baker2016$flagged <- factor(baker2016$flagged,levels = c('Large','Fair','Unsure', 'Little'), ordered = TRUE)
hist.familiarity <- ggplot(baker2016, aes(x= flagged, fill = as.factor(ifelse(flagged %in% c('Fair','Little'),1,2)))) +
geom_bar()+
xlab('Response')+
ylab('Count')
#plot it
hist.familiarity Scatterplot - 2D data: Plot a relationship between variables
We want to check whether the dichotomy in high and low familiarity with Open Science is somehow a reflection of different levels of trust between fields.
First of all, we cannot use raw count numbers because they differ between fields. Also, we need a more ‘compact’ measure to summmarise trust and familiarity.
An easy solution, is to calculate a ratio for each field between those who are familiar/unfamiliar with open science as well as a ratio between the number of participants who trust (>50% reproducibility) or not trust (<50% reproducibility) their field. This is accomplished within the first lines of code where two new columns (ratio.Familiarity and ratio.Trust) are created to accommodate the ratios. The pipe %>% operator simply indicates that the argument on the left (or its outcome) is the input to the argument on the right. group_by(field) allow to execute the operations in mutate() (i.e. creation of a new column) separately for each level of field. For more information you can have a look at the ‘tidyverse’ web-page.
We use a scatterplot and a linear regression to visualise the relationship between the ratios.
#calculate ratio between confidence/non confidence on general knowledge of reproducibility
baker2016 <- baker2016 %>% group_by(field) %>% mutate(ratio.Familiarity = sum(flagged == 'Fair')/sum(flagged == 'Little'))
#calculate ratio between confidence/non confidence on reproducibility in their own field
baker2016 <- baker2016 %>% group_by(field) %>% mutate(ratio.Trust = sum(as.numeric(sub("%", "",proportion_reproducible)) >50)/sum(as.numeric(sub("%", "",proportion_reproducible)) <51))
#plot
scatterPlot <- ggplot(baker2016,aes(ratio.Trust, ratio.Familiarity, color=field)) +
# show data points with size 3
geom_point(size = 3) +
# set x/y axes limits
xlim(0,5)+ylim(0,5)+
# we want to see a linear regression (this is for illustrative purposes, analyses should be done aside)
geom_smooth(method = "lm", se=FALSE, color="black", formula = y ~ x)+
# use a minimalist theme to remove clutter (e.g. background color, axes)
theme_minimal()
#plot it
scatterPlotHeatmap - 3D data: Plot a relationship between variables
In our exploratory visualization we have used density plots to visualise how the distributions of values of trust in reproducibility differ between the fields
ggplot(baker2016, aes(group= field, fill=field, x = proportion_reproducible)) +
geom_histogram(stat = 'density') +
facet_wrap(~field)## Warning: Ignoring unknown parameters: binwidth, bins, pad
That was useful for us to get an idea of how the data were distributed. However now we want to convey the message that there are not major differences between fields in that they all share a high level of trust. For this purpose, we use a more compact visualisation using a heatmap.
Note, we first summarise the data using proportions as we are interested in relative values as opposed to raw counts.
# create a new dataframe df
df <- baker2016 %>%
#select only two columns
select(c('proportion_reproducible','field')) %>%
#perform the analyses on each level of proportion_reproducible by field
group_by(field,proportion_reproducible) %>%
#count the items for each level
summarise(n = n()) %>%
#calculate proportion for each level
mutate(freq = n/sum(n))
#order factors
df$proportion_reproducible <- factor(df$proportion_reproducible,levels = paste0(seq(0,100,10),'%'), ordered = TRUE)#Assign color variables
col1 = "#d8e1cf"
col2 = "#438484"
heatmap <- ggplot(df, aes(proportion_reproducible, field)) +
#select the plot type and filling, separation lines = white
geom_tile(aes(fill = freq),colour = "white", na.rm = TRUE) +
#use the assigned colors above for the filling
scale_fill_gradient(low = col1, high = col2) +
theme_bw() + theme_minimal() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
heatmap2.Communicate a story with annotations
We will use {ggforce} to annotate our plots
library(ggforce)Let’s remove the legend and add a title and subtitle
hist.familiarity <- hist.familiarity +
theme_minimal()+
#we do not need a legend as colors are to attract the attention
theme(legend.position = 'none')+
ggtitle('"Open Science" familiarity')
hist.familiarity The scatter plot looks quite complex and needs some extra information to become completely clear. We will add text to indicate the meaning of illustrative data points using geom_mark_ellipse from {ggforce}. The filter = field == 'Astronomy and planetary science' option indicates which data point(s) need to be highlighted.
We also add a title and subtitle to convey the main message of the plot
scatterPlot<- scatterPlot+
#add text to make the figure more understandable at first glance
geom_mark_ellipse(aes(filter = field == 'Astronomy', label = 'Astronomy',
description = 'Low familiarity with Open Science but high trust in the field '),label.fontsize = 8) +
geom_mark_ellipse(aes(filter = field == 'Medicine', label = 'Medicine',
description = 'Low familiarity with Open Science and low trust in the field '),label.fontsize = 8) +
#Tidy up labels for readability
xlab('Trust (ratio)') + ylab('Familiarity (ratio)') +
#Add title and subtitle so that readers have immediate understanding of the plot
ggtitle("Familiarity with Open Science vs Trust in the field",subtitle = 'Linear regression (black line) shows no effect')
# plot it
scatterPlotFinally we will add title and subtitle to the heatmap and
heatmap<- heatmap+
#add legend title
guides(fill=guide_legend(title="Prop"))+
#add title and xy labels)
labs(title = "Trust in reproducibility", x = "Percentage reproducibility", y = "Field") +
# rotate xtick labels to avoid orverlap when trying to combine images
easy_rotate_x_labels(side = "right")
#plot it
heatmap3.combine plots to convey a story
library(patchwork)We will use {patchwork} to combine plots together into a figure. An alternative package is {Cowplot}.
“Patchwork offers an intuitive syntax”
“Dimensions and layouts are easily customised”
hist.familiarity + heatmapAdd tag to plots
hist.familiarity + theme(plot.margin = unit(c(30,30,30,0), "pt"))+heatmap + plot_annotation(tag_levels = 'A') + plot_annotation(title = 'Familiarity with Open Science and Trust in preproducibility',
subtitle = 'High discrepancy in familiarity but overall high trust in reproducibility',
caption = 'Disclaimer: None of these plots are insightful')scatterPlot+plot_layout(guides = 'collect') + plot_annotation(tag_levels = c('A','1'), tag_prefix = 'Fig. ', tag_sep = '.', tag_suffix = ':') & theme(plot.tag=element_text(size = 10)) Combine altogether. plot_layout(guides = 'collect') sometimes is needed to align the legends. theme(plot.margin = unit(c(0,60,0,0), "pt")) is to add some space between the plots. It will require a bit of fiddling to obtain adjust the figures (e.g. for readibility of percentages)
scatterPlot / (hist.familiarity + heatmap)Further Resources
This website could be considered the bible of plotting data in R. This ggplot cheatsheet is great to have at hand. Here you can find info on how to control layouts with patchwork.